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Abstract 

Due to its cost effectiveness, next-generation sequencing of pools of individuals (Pool-Seq) is becoming a popular strategy 
for characterizing variation in population samples. Because Pool-Seq provides genome-wide SNP frequency data, it is 
possible to use them for demographic inference and/or the identification of selective sweeps. Here, we introduce 
a statistical method that is designed to detect selective sweeps from pooled data by accounting for statistical challenges 
associated with Pool-Seq, namely sequencing errors and random sampling among chromosomes. This allows for an 
efficient use of the information: all base calls are included in the analysis, but the higher credibility of regions with higher 
coverage and base calls with better quality scores is accounted for. Computer simulations show that our method efficiently 
detects sweeps even at very low coverage (0.5 x per chromosome). Indeed, the power of detecting sweeps is similar to 
what we could expect from sequences of individual chromosomes. Since the inference of selective sweeps is based on the 
allele frequency spectrum (AFS), we also provide a method to accurately estimate the AFS provided that the quality scores 
for the sequence reads are reliable. Applying our approach to Pool-Seq data from Drosophila melanogaster, we identify 
several selective sweep signatures on chromosome X that include some previously well-characterized sweeps like the wapl 
region. 

Key words: selective sweeps, next-generation sequencing, pooled DNA, Drosophila, allele frequency spectrum, hidden 
Markov model. 



Introduction 

The detection of genomic regions that have evolved under 
natural selection is an important topic in population genet- 
ics, which poses interesting theoretical challenges and 
holds great potential for medical and economic benefits. 
The case of hard sweeps, where a new mutant goes to fix- 
ation in a population due to strong directional selection, 
has received particular attention. Exploiting a typical pat- 
tern of reduced genetic diversity in the vicinity of the se- 
lected site, several methods were proposed to detect such 
events by screening the allele frequencies along the genome 
in a single population (Kim and Stephan 2002; Jensen et al. 
2005; Nielsen et al. 2005; Boitard et al. 2009) and were ap- 
plied to several species (Li and Stephan 2006; Williamson 
et al. 2007). 

Today, the advent of next-generation sequencing (NGS) 
technologies provides a new dimension to such genome 
scans for selection. Genomes can be covered with very high 
density, and the ascertainment bias caused by SNP identi- 
fication is becoming less important. Currently, the precise 
identification of individual genotypes, which requires a high 
sequencing coverage of each individual, remains very ex- 
pensive for large samples. However, hard sweeps can also 
be detected when using only information concerning the 
genetic diversity of the sample along the genome. This 



information can be obtained also from experiments where 
the DNA from a pool of individuals is sequenced simulta- 
neously (Pool-Seq). Although Pool-Seq is considerably 
cheaper than the sequencing of individuals, there are 
some methodological challenges associated with the anal- 
ysis of the resulting data. For a discussion, see Futschik and 
Schlotterer (2010). 

The new sequencing technologies have resulted in 
a dramatic cost reduction compared with classic Sanger 
sequencing, but the error rate per sequence is considerably 
higher. Even for diploid individuals, the distinction between 
sequencing errors and true SNPs is challenging when the 
coverage is not high enough. Similarly, for Pool-Seq, the 
identification of rare SNPs is difficult. One common strat- 
egy is thus to remove all singletons or doubletons from the 
analysis, because they might also result from sequencing 
errors. For the same reason, base calls with low-quality 
scores tend to be removed as well. Although it is possible 
to obtain unbiased estimates of genetic diversity using this 
approach, it is apparent that information is lost. In partic- 
ular, the detection of selective sweeps could be compro- 
mised by this strategy because low-frequency alleles are 
an important signal to detect recent selective sweeps. 

Hidden Markov models provide a natural framework to 
take both sequencing errors and unequal local coverage 
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into account. Here, we develop a Hidden Markov Model for 
detecting sweeps using pooled NGS data: This model ex- 
tends the one investigated in Boitard et al. (2009) for clas- 
sical sequencing data. As part of the model, we also 
introduce a version of the Expectation Maximization 
(EM) algorithm to estimate the allele frequency spectrum 
(AFS) using the information from all available genomic 
positions. Indeed, the estimated AFS is used to scan the 
genome for regions where the AFS is biased toward ex- 
treme allele frequencies. Our approach involves computing 
the likelihoods of the observed read information condi- 
tional on the number of derived alleles in the pool across 
genome positions. It takes into account the uncertainty 
concerning the true allele frequencies in the pool, which 
might typically be higher for sites with low-coverage or 
bad-quality scores. 

Using computer simulations, we study the accuracy of this 
procedure for estimating the AFS, and its power for detecting 
selective sweeps. We then apply our approach to scan the X 
chromosome of Drosophila melanogaster, using two pooled 
samples of 97 flies sequenced at 100x coverage. 



Schlotterer (2010). The sequencing error probabilities, e,j, 
can be deduced from the PHRED scores, Qy, provided with 
the sequenced data, using the relation e-,j = '\Q~ Q - i >l w . As ll- 
lumina PHRED scores are known to be biased (Dohm et al. 
2008), we include a discussion concerning the effects of in- 
accurate quality scores, as well as possible strategies to cope 
with the problem. (See the section on the real data appli- 
cation.) 

Estimation of the AFS 

Let p = (p 0 ,. . .,p„) be the AFS in a region of length L 
covered by the reads, that is, pj is the probability of observ- 
ing j copies of the derived allele among the n chromosomes 
at a given genomic position. The likelihood of this 
spectrum given the observed reads is 
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Materials and Methods 

Accounting for Sequencing Errors and 
Chromosome Sampling at One Position 
We consider here a sample of n chromosomes that have 
been subjected to Pool-Seq. We assume an infinite sites 
model and denote by Y„ the number of derived alleles 
at genomic position / (0 < Y, < n). With NGS of pools, 
Yj is unobserved. We observe a collection of r, reads, among 
which the observed number of derived alleles will usually 
differ from Y due to 1) the random sampling of reads from 
the n chromosomes and 2) the sequencing errors. Let Zy 
(0 < j < rj) denote an indicator variable equal to 1, if 
the observed allele at read j is derived, and let Z,= 
(Zj-|, . . . ,Z j/t ). Let furthermore, e,; be the probability for 
a sequencing error at read j. The conditional probability 
of the observed reads Z, given Y, is then equal to 
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In this equation, YJn should be interpreted as the prob- 
ability that read j is taken from the subset of derived alleles 
in the pool. Because the sequencing is performed on one 
single pool, this probability is the same for all reads j. It is 
equal to YJn because we assume that reads are sampled 
uniformly from each of the n chromosomes. Indeed, we 
do not account here for the possible biases arising from 
unequal concentration or quality among individuals, or al- 
lele specific amplification. Note that the influence of un- 
equal concentration or quality among individuals on 
allele frequency estimation are expected to be low for large 
sample sizes, as shown by the derivations of Futschik and 



As this likelihood involves a summation over the unob- 
served variables Y„ we propose to maximize it using an EM 
strategy. Our algorithm starts from an arbitrary initial value 
of p and iteratively computes new values of p that increase 
the current likelihood. If p c is the current value of the AFS, 
the next value is given as 



P(Z / |Y i =j)pf 



(3) 



The EM iterations are thus based on the conditional 
probabilities computed using equation (1). This algorithm 
is similar to the EM-AFS strategy independently proposed 
in Li (2011). More details about the algorithm are provided 
in the Supplementary Material online. 

What we denote by p here is an estimate of the allele 
frequency probabilities in a random sample of size n from 
the population. Inference based on coalescent theory, as 
the derivations of Nielsen et al. (2005) used in our sweep 
detection model, generally involve this quantity. Note, 
however, that the shape of this sample AFS can be expected 
to resemble the shape of the AFS in a population of size 
N. Indeed, it also permits to come up with an estimate 
of the population allele frequency distribution in terms 
of an approximate continuous model. A natural way to es- 
timate the parameters of the continuous model would be 
via maximum likelihood. In a Bayesian context, Gompert 
and Buerkle (2011) use a continuous parametric model to 
come up with a prior distribution for p in their hierarchical 
model. 

Detection of Selective Sweeps 

Since the allele frequency pattern in the vicinity of a se- 
lected allele differs from the one under neutrality, such lo- 
cal variation in allele frequencies can be used to detect past 
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selection events (Kim and Stephan 2002; Nielsen et al. 
2005). To detect selective sweeps from Pool-Seq data, 
we extend the Hidden Markov Model (HMM) approach 
that we initially developed for completely sequenced data 
(Boitard et al. 2009). 

We assume that each site / is associated with a hidden 
state Xj, which can take three different values: "Selection," 
for the sites that are very close to a swept site; "Neutral," for 
the sites that are far away from any swept site; and "Inter- 
mediate," for the sites in between. These three values are 
associated with different frequency spectra (the "Selection" 
spectrum is more skewed toward low and high allele 
frequencies than the "Intermediate" spectrum, and even 
more than the "Neutral" one). The hidden states X; form 
a Markov chain along the genome with a per site proba- 
bility q of switching state, so that close sites tend to be in 
the same hidden state. The observed variables are Z„ 
containing the site frequencies taken from the pooled 
sequence reads. After computing suitable emission proba- 
bilities, the Viterbi algorithm is used to predict the most 
likely hidden states X, from the observed states, and thus 
detect the swept regions. Combining equation (1) with the 
AFS in hidden state X, leads to the emission probabilities 

P(z,|x,)= £p(z,|y,K;. (4) 

Y, = 0 

We prefer, however, to only consider those sites for 
which two different alleles are observed among the reads 
(i.e., where Y%=-\ Z,j>0), and run the HMM using the emis- 
sion probabilities 

These emission probabilities account also for the fact 
that an observed polymorphism could be due to a sequencing 
error. 

Notice that the approach in Boitard et al. (2009) would 
not be applicable here, as it assumes equal coverage at each 
position and also that the true numbers of derived alleles Y, 
are known. 

A natural method for obtaining allele frequency spectra is 
to first estimate the AFS under the state "Neutral" by apply- 
ing the EM algorithm presented above to the whole genome 
data. An approximate AFS for the other hidden states can 
then be obtained by adequately modifying the neutral AFS 
using the method described in Nielsen et al. (2005). 

Simulations 

We used MSMS (Ewing and Hermisson 2010) to simulate 
genomic samples under a coalescent model with mutation, 
recombination, and constant population size. Our consid- 
ered models involved both neutrality and a selective sweep 
at a single locus. From the genomic samples obtained from 
MSMS, pooled NGS data were simulated using our own 
MATLAB code: For each site, a number of reads r t was 
simulated (independently of the other sites) from a Poisson 



Table 1. Selective Sweep Detection Power. 



Sample Size 


n = 25 n 


= 50 n 


= 100 n 


= 200 


Pooled NGS data 


0.91 


0.90 


0.91 


0.90 


Sequence data — all sites 


0.89 


0.88 


0.87 


0.87 


Sequence data — segregating sites 


0.57 


0.60 


0.64 


0.62 



Detection power for pools of n — 25 to n = 200 chromosomes of length L = 100 
kb simulated under a constant population size coalescent model with 0 — 0.003, 
p = 0.003, and a = 500. NGS data sets were simulated with an expected 
coverage, X = 100. 



distribution with expected value X, the coverage of the 
sequencing experiment. For each read j (1 < j < r), we 
then simulated the allele type Z,j from a Bernoulli distribu- 
tion with parameter YJn. (Recall that Y, is the derived allele 
frequency at site /' in the pool.) Next, a sequencing error 
probability e h j was generated by drawing from the empirical 
distribution of PHRED scores, observed in our NGS data set 
(supplementary fig. S3, Supplementary Material online). Fi- 
nally, sequencing errors were introduced by drawing from 
a Bernoulli distribution with parameter e f ;. This leads to a sim- 
ulated NGS sample, which can then be used for the AFS es- 
timation and the detection of selective sweeps. 

For detecting selective sweeps, we adapted the approach 
taken in Boitard et al. (2009) to pooled NGS samples in- 
stead of complete sequence data. When analyzing our sam- 
ple, we identify a selective event as soon as the state 
"Selection" has been predicted for at least one site. For eval- 
uating the detection power under a given selective sce- 
nario, we simulate several samples under this scenario 
(500 in this study, because of the high computational cost 
of the analysis) and look at the percentage of scenarios for 
which a sweep window is detected that also includes the 
true position of the selected site. The output of the analysis 
depends on the switching probability q used in the tran- 
sition matrix of the HMM. In order to calibrate this prob- 
ability, we preliminarily simulated 500 neutral samples 
under the same demographic scenario and analyze them 
with different values of q. We select a value of q such that 
selection is detected in 5% of these neutral samples, which 
means that we have a false positive rate of 5%. For the re- 
sults shown in table 1, the selected value of q was around 
2.10~ 4 , with little variation for different sample sizes. 

For the comparison between pooled NGS data and 
complete sequence data, we applied the HMMs described 
in Boitard et al. (2009). 

Analysis of Drosophila Chromosome X 
We looked for selective sweeps on the X chromosome of 
Drosophila melanogaster using two samples of 97 female 
flies, both taken from the F1 generation derived from 
5,000 flies, which were collected in November 2009 at 
the Kahlenberg, Austria. These flies were adapted to lab 
conditions during 2 days before they reproduced to form 
the F1 generation. Two samples of 97 females were 
subjected to Pool-Seq. 

Genomic DNA was extracted from 97 individuals, which 
were homogenized with a Ultraturrax T10 (IKA-Werke, 
Staufen, Germany) and purified with the Qiagen DNeasy 
Blood and Tissue Kit (Qiagen, Hilden, Germany). Genomic 
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DNA was sheared with a S2 device (Covaris, Inc., Woburn, 
MA) and used to prepare paired-end genomic libraries with 
the Paired-End DNA Sample Preparation Kit (lllumina, 
San Diego, CA) following the manufacturer's instructions. 
Sequencing was performed with an lllumina GAIIx sequencer. 

Reads were trimmed to remove low-quality bases and 
mapped with bwa (version 0.5.7) (Li and Durbin 2009) 
against the D. melanogaster reference genome (version 
5.18) and Wolbachia (NC_002978.6). We used the follow- 
ing mapping parameters: -n 0.01 (error rate), -o 2 (gap 
opening), -d 12 and -e 12 (gap length) disabling the 
seed option. The alignment files were converted to the 
Sequence Alignment/Map (SAM) format using the bwa 
module sampe enabling a local alignment procedure 
(Smith-Waterman), whenever one of the reads of the pair 
could not be mapped with global alignment. The SAM files 
were filtered for reads mapped in proper pairs with a min- 
imum mapping quality of 20 using SAMtools (Li et al. 2009). 
The filtered SAM files were converted into the pileup for- 
mat. We used RepeatMasker 3.2.9 (www.repeatmasker.org) 
to create a gff file to mask simple sequence repeats and 
transposable elements of the D. melanogaster genome ver- 
sion 5.34. Finally, indels together with five flanking nucleo- 
tides (on both sides) were masked in the alignments of 
each population if the indel was present in at least one pop- 
ulation and supported by at least two reads. 

The expected coverage was 100x for sample 1 and 87 x 
for sample 2. 

We also explored recalibration of the read qualities using 
GATK (DePristo et al. 2011) before creating the pileup file. 
This software estimates the sequencing error probabilities 
based on reads from sites that are assumed to be nonpo- 
lymorphic. Consequently, a list of true polymorphic sites is 
needed. We included in this list: 1) the transposable ele- 
ment positions reported by RepeatMasker (see above), 
2) the positions flanking indels (5 bp upstream and down- 
stream), and 3) the positions with more than two copies of 
the minor allele. 

The statistical analysis (both for AFS estimation and for 
genome scans for selection) was based on folded sequence 
data, so we did not require for SNPs the ancestral alleles to 
be known. The allele labels 0 and 1 have thus been chosen 
arbitrarily. We used the folded likelihood 

P / (z,|y,) = ^P(z,|y,) + lp(i - z,|y,). (6) 

Polymorphic sites with three different alleles were also 
used in the analysis. They were converted into SNPs by 
removing the least frequent allele, which we considered 
to be most likely due to a sequencing error. 

For computational reasons, the AFS estimation was 
based on only 10% of the sites from the pileup file. This 
subset was selected at random and included about 2 million 
sites, which was largely sufficient to estimate the AFS in 
a pool of 200 chromosomes (see simulation results). 

An important tuning parameter of the selection scans 
based on our HMM is the transition probability q between 



neutral and selected states. The larger the q, the less evi- 
dence is required for a transition to selection and the more 
sweep candidates will be detected. In our real data appli- 
cation, the transition probabilities were based on the 
genetic locations, which were deduced from physical 
locations using Marey maps (Fiston-Lavier et al. 2010). 
The probability of switching state between two consecutive 
SNPs is then given by qd, where d was the genetic distance 
between the two SNPs. 

To avoid a high rate of false positives, it is important to 
choose a small enough value for q. A natural strategy is to 
simulate sequences under a neutral scenario with realistic 
demography and estimates for mutation and recombina- 
tion. Based on such simulations, q = q sim can be chosen 
such that the probability of falsely detecting selection 
on any segment of a given length is controlled and kept 
below a certain threshold cc, as already explained in the 
"Simulations" section. However, a simulated scenario will 
always involve some simplifications or biases compared 
with the real demography, and the real background sce- 
nario will usually be unknown. To account for this uncer- 
tainty, a conservative approach is to choose a value of q 
lower than q sim . In the extreme case, if the simulated model 
were completely unrealistic, it would actually make sense 
to choose q = 0 so that no false positives will be obtained. 

Even with a good estimation of the population demog- 
raphy, reliable neutral simulations are difficult to design 
and extremely time intensive, because they must account 
for the variation of recombination rate along the whole 
analyzed region (from 1 to 4 cM per Mb in our case, plus 
a large region with no recombination). Besides, in the spe- 
cific case of Pool-Seq, the scaled recombination rate cannot 
be estimated from the data because haplotype information 
is not available. Consequently, we decided to choose q sjm 
using a very simple model and to select a value of q 
considerably lower than q sjm . 

More specifically, we simulated neutral samples with 
length 100 kb under a model with constant population size. 
We chose 0 = 0.003, which is consistent with the AFS es- 
timated from our data, and p = 9 as suggested by the re- 
sults of Haddrill et al. (2005) for non-African populations of 
D. melanogaster. The analysis of these samples using the 
folded AFS likelihood described in equation (6) led to q sjm 
= 4.10~ 5 . With this value, we only detected a sweep signal 
in 1% of the simulated samples with length 100 kb, which 
suggests only one false positive signal every 10 Mb. In 
order to take into account the discussed uncertainties 
about the true model, we decided to work with the value 
q = 10~ 10 which is considerably below that obtained via 
simulations. 

For a more detailed discussion concerning the choice of 
transition probabilities, see Boitard et al. (2009). 

Results 

Accuracy of the AFS Estimation 

In order to investigate the accuracy of our AFS estimation 

procedure, we simulated reads from 100 pools of sequences 
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Fic. 1. AFS estimation. Pools of n = 25 (a) and n = 50 (fa) chromosomes of length L = 100 kb were simulated under a constant population 
size coalescent model with 6 = 0.003 and p = 0.003. Solid lines show the AFS extracted from the complete sequence information and averaged 
over 100 simulated samples (it closely fits the AFS expected from coalescent theory). Diamonds and error bars represent the average estimated 
AFS and the average absolute deviation respectively using the same 100 samples. The estimates were obtained from pooled NCS data with 
100X expected coverage using the EM algorithm. 



under neutrality. We considered four different pool sizes 
(n = 25, 50, 100, and 200), took I = 100 as the expected 
coverage, and L = 100 kb as sequence length. For further 
details, see the section on Materials and Methods. Under 
this setup, we compared the AFS estimated from pooled 
NGS data with our EM algorithm to the AFS computed 
under the assumption that the complete genetic informa- 
tion of the pool were available. As shown in figure 1 for n = 
25 or 50, we found that our estimation procedure was es- 
sentially unbiased and had a small average absolute devi- 
ation. The main difference between the results obtained for 
n = 25 and n = 50 was a slight underestimation of the 
singleton frequency estimated when n = 50. This is likely 
due to the lower per chromosome coverage in this case, 
which implies that it is more difficult to decide whether 
observed singletons are true or result from sequencing 
errors. Results obtained for n = 100 and 200 have been 
very similar to those obtained with n = 50. Overall, our 
simulations show that accurate estimates of the AFS can 
be obtained from NGS data from pools with low per 
chromosome coverage (0.5 x), despite of the sequencing 
errors. We also point out that the accuracy of the estimates 
will increase with sequence length, suggesting a very high 
accuracy when estimating the AFS at a whole genome scale. 
Notice, however, that the simulations were performed 
under the assumption that the probabilities of sequencing 
errors are known. As discussed below, inaccurate or biased 
error probabilities result in biased AFS estimates. 

Selective Sweep Detection Power 
Next, we simulated reads under a selective sweep scenario. 
The simulation parameters were the same as above, except 
that one selected locus with selection intensity a = 2 N s = 
500 was placed in the middle of the 100 kb segment. This 
value of a corresponds to rather weak selection, compared 
with the distribution of selection intensities for sweeps 



identified in D. melanogaster (Li and Stephan 2006). For each 
simulated sample, we detected selection using either com- 
plete sequence data and the method in Boitard et al. (2009) 
or Pool-Seq data and the method presented here. As our new 
method extends that in Boitard et al. (2009), it should be of 
interest to compare the power of the two methods that 
make use of different amounts of information. For the anal- 
ysis of the complete sequence data, we used either all sites or 
only segregating sites. Recall that the lower density of segre- 
gating sites (i.e., the larger probability of allele count 0) in 
swept regions is used as an additional sweep signal when 
using all sites. 

As shown in table 1, the detection power with pooled 
data using only segregating sites was similar to that ob- 
tained with sequence data and all sites. At first sight, it 
might be surprising that the power was even slightly better 
with pooled samples. A closer look reveals, however, that 
the estimated sweep windows were usually slightly larger 
with pooled data than with classical sequencing data, 
and consequently had a higher probability of including 
the true selected site. The slight gain in detection power 
is thus associated with a slight loss in accuracy of localizing 
the sweep. Nevertheless, it is surprising that the results for 
pooled samples were considerably better than those for 
error-free classical separate sequencing when in both cases 
only segregating sites are used. A possible explanation is 
that with NGS sequencing data many singletons are 
sequencing errors at nonpolymorphic sites. Since a high 
proportion of singletons serves as a signal of selection such 
sequencing errors seem to increase the sensitivity of our 
test without causing an excess of false positives. 

Application to Real Data in Drosophila 
Using our new approach for Pool-Seq data, we estimated 
the AFS and searched for selective sweeps on the X chro- 
mosome of an Austrian Drosophila melanogaster 
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population. We analyzed a pool of 97 female flies from this 
population that has been sequenced at 100x coverage (for 
more details, see Materials and Methods). The sweep 
regions found with our scan are listed in table 2. Most 
of these regions were between 10- and 40-kb long, suggest- 
ing that hitchhiking mapping from Pool-Seq data identifies 
narrow intervals containing only a few genes that may have 
undergone recent selective sweeps. A few longer regions 
(up to 400 kb) were detected close to the centromere 
(supplementary fig. SI, Supplementary Material online). 
This is due to the fact that the recombination rate is very 
low close to the centromere, which increases the hitch- 
hiking effect of positive selection. 

Some of the detected regions were already identified by 
previous studies as sweep candidates in Europe. For in- 
stance, region 10 corresponds to the wapl region identified 
in Beisswanger et al. (2006). This region had a very high 
confidence index, within the top 10 of table 2. Interestingly, 
the size of the sweep window inferred by our method is 
similar to the one previously reported (Beisswanger et al. 
2006) (74 vs. 60.5 kb). Region 16, which is located around 
the gene unc-119, was detected in Glinka et al. (2006). 

Apart from these well-characterized regions, we also de- 
tected some narrow sweep windows with a high confi- 
dence score. The high confidence region 14 contains 
only a single gene, Ca-alphalT, which is predicted to en- 
code a Calcium channel (http://flybase.org). Four regions 
encompass only two annotated genes in D. melanogaster. 
One of them, region 28, contains the gene Shaker (Sh), 
which encodes a voltage-dependent potassium channel 
and has been shown to affect sleeping behavior and life- 
span (Cirelli et al. 2005). 

The AFS estimated from our sample had an unusual pat- 
tern, showing a reduced proportion of extreme allele 
counts (fig. 2a). To investigate potential causes, we first 
considered the nonnegligible proportion of tri-allelic SNPs 
(about 1% of all sites). In our analysis, the least frequent 
allele of tri-allelic SNPs has been removed systematically 
(see Materials and Methods). We therefore reestimated 
the AFS after having removed all tri-allelic SNPs, but this 
resulted essentially in the same AFS pattern (data not 
shown). Hence tri-allelic SNPs cannot explain the observed 
deficit in low-frequency alleles. We also studied the influ- 
ence of the coverage per site, by estimating the AFS using 
only positions with a specific coverage, and obtained again 
similar patterns. Varying the initial AFS that is used as start- 
ing point for the EM algorithm also had little influence on 
the finally estimated AFS, even when we started the algo- 
rithm from the AFS of an expanding population, which is 
characterized by an excess of small minor allele counts. We 
observed different estimated AFS patterns, however, when 
we restricted the analysis to base calls characterized by 
a specific range of PHRED scores. In particular, the restric- 
tion to base calls with PHRED score greater than 35 resulted 
in an estimated AFS with no deficit in extreme allele counts 
(fig. 2b), which is a reasonable neutral background AFS. 
Computer simulations show that the estimation bias ob- 
served when using all base calls is not caused by the 



low quality of many base calls in itself, but rather arises 
from a discrepancy between the PHRED scores provided 
by lllumina and the exact sequencing error probabilities. 
This observation is consistent with previous results (Dohm 
et al. 2008), showing that lllumina scores tend to be too 
pessimistic. The resulting overestimated probabilities for 
sequencing errors affected in particular those sites with 
low minor allele frequencies. 

As an alternative to filtering with respect to quality 
scores, we recalibrated the quality scores using GATK 
(DePristo et al. 2011) before estimating the AFS. However, 
this correction had little effect on the AFS pattern. A reason 
for this might be that GATK uses monomorphic positions 
for the recalibration. Since we provided a list of SNPs with 
at least two copies of the minor allele in our sample, the 
remainder of the sequence contained singletons, which 
were a mixture of sequencing errors and true singletons. 
Our failure to distinguish true polymorphism from 
sequencing errors may have negatively affected our efforts 
to recalibrate the quality scores. 

Since base calls with PHRED score greater than 35 pro- 
vide a more reliable estimate of the AFS, we also performed 
a scan for selection using only these base calls, and com- 
pared the results with those obtained using all base calls. 
The signals obtained with the two strategies were generally 
consistent (fig. 3). Among the 32 sweeps detected with all 
base calls, 24 were confirmed using only high-quality base 
calls. The proportion of sweeps detected with both strat- 
egies increased with the confidence index. Among the 15 
sweeps detected using all base calls with a confidence index 
greater than 20, 13 were confirmed using only high-quality 
base calls. 

In order to see whether the sweep windows that were 
not confirmed when only using high-quality base calls are 
false positives or rather false negatives, we sequenced (at 
87x coverage) a further independent sample of 97 flies 
from the same population. Due to the random sampling 
of different flies in the two pools and to the random differ- 
ences of coverage and base quality scores along the ge- 
nome inherent to NGS technology, we do not expect 
to find the same false positives in the two samples. 
The new sample provided a very similar estimated AFS 
(not shown), which suggests that no major experimental 
problem affected either of the samples. Furthermore, 
most sweep windows detected using all base calls were 
detected again using the second sample (29 over 32, 
see also supplementary fig. S2, Supplementary Material 
online). In particular, 7 of the 8 sweep windows that were 
not confirmed using only high-quality base calls were de- 
tected using the second sample, and can thus be seen as 
false negatives in the analysis focusing on high-quality 
base calls. This suggests that sweep detection based on 
all sites is more reliable than expected given the pro- 
nounced bias in global AFS estimation. A possible expla- 
nation for this consistency is that the bias in AFS 
estimation is homogeneous along the genome and does 
not affect the detection of true local variation in the AFS 
along the genome. 
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Table 2. Selective Sweeps Detected on Chromosome X in Drosophila melanogaster. 



Region 


Start' 


End 3 


Length 3 


Cl b 


Genes within the Window 


1 


19 


460 


441 


Inf 


CC17636, RhoGAPIA, CC17707, SP71, CC3038, CG2995, cin, CC13377 
CC13376, ewg, CC3777, CC13375, CC12470, Or1a, CC32816, y, ac, sc 
l(1)sc, pel, ase, Cyp4g1, Exp6, CG13373, CC18275, CG32817, CG18166 
CC3176, CG 18273, CG3156, CGI 7896, CG 17778, svr, arg, elav, CG4293, Appl 


2 


530 


669 


139 


Inf 


su(s), CG13367, Roda, Suv4-20, skpA, sdk, CG13362, CG13361, CG5254 
CG5273, RpL22, fz3 


3 


1,046 


1,144 


98 


Inf 


elF4E-7, CG34320, CG11378, CG11384, CG11379, CG14627, CG14626 
CG11380, CG14625, CG11381, CG14624, CG11382, CG11398, CG3638 
CG11403, A3-3 


4 


1,179 


1,312 


133 


Inf 


CG32812, DAAM, CG18091, fs(1)N, CG11409, CG11412, CG11418, Tsp2A 
CG12773, CG11417, png, CG14770, CG3056, SNF1A, CG3719, CG32813 
CG11448, futsch 


5 


1,338 


1,369 


31 


6.8 


futsch, Gr2a, CG14785, CG14786, CG14787, 1(1)G0431, 0-fut2, CG14777 
CG32808, CG 14778, pek, CG 14780, Rab27 


6 


1,373 


1,408 


35 


33.7 


CG14782, sta, Nmdar2, CG14795, CG32810 


7 


1,456 


1,484 


28 


5.7 


no gene 


8 


1,658 


1,693 


35 


28.1 


Adar, CG32806 


9 


1,728 


1,809 


81 


33.8 


CG14801, CG14812, deltaCOP, CG14814, MED18, CG14815, CG14803 
CG14816, CG14804, CG14817, CG14805, CG14818, CG14806, trr 
mRpL16, arm, CG32803, CG32801, Edeml, mip130, CG17766 


10 


1,995 


2,069 


74 


33.1 


csw, ph-d, ph-p, CG3835, Pgd, bcn92, wapl, Cyp4d1, CG3630, CG3621 
Cyp4d14 


11 


2,092 


2,118 


26 


19.8 


Mct1, CG18031, msta, Vine, CG14052 


12 


3,662 


3,681 


19 


12.7 


Tlk 


13 


5,766 


5,784 


18 


7.9 


CG3033, mof, CG3016, CG16721 


14 


6,023 


6,061 


38 


30.6 


Ca-alpha1T 


15 


7,028 


7,054 


26 


27.7 


no gene 


16 


7,152 


7,191 


39 


14.4 


CG1958, CG1677, CG2059, unc-119 


17 


7,336 


7,419 


83 


32.4 


CG11368, CG32719 


18 


7,821 


7,848 


27 


31.1 


CG10777, CG10778, RpS14a, RpS14b, CG1530, l(1)G0193, CG1531, CG15332 


19 


10,358 


10,383 


25 


16.3 


CG17255, CG2889, CG2887, PPP4R2r, CG32687 


20 


11,371 


11,407 


36 


31.3 


Cyp4g15, CG1749, Spase25, CG33235, CG32666 


21 


11,441 


11,499 


58 


32.4 


CG32666, CG1572, PGRP-SA, Rpll215, CG11699, l(1)G0237, CG11697 
CG11696, e(y)2, CG11695, nod, CG1561, rho-4, CG2533 


22 


11,868 


11,893 


26 


15 


cac, gd, tsg, CGI 81 30, fw 


23 


1 3,098 


13,123 


25 


15 


sno, REG, mew 


24 


14,937 


14,953 


16 


6.6 


hiw, CG5541 


25 


1 5,696 


15,716 


20 


14.7 


PGRP-LE, sd, CG8509 


26 


15,824 


15,846 


22 


16.5 


Ranbp16, Stim, CG8924, CG8928, CG15603, CG15604 


27 


17,743 


17,764 


21 


17.1 


CG15814, CG6506, CG32554, CG32557, CG6762, Arp8, CG6769, mnb 


28 


17,924 


17,956 


32 


32.2 


Sh, CGI 5373 


29 


18,539 


18,559 


20 


13.2 


l(1)G0003, CG6540, CG6617, Ing3, CG6659, fu, CG6696 


30 


19,455 


19,479 


24 


17.5 


Grip84, car, Tao-1, CG14218, CG14204 


31 


20,978 


21,009 


31 


19.6 


CG11566, stgl, unc, CG15445, CG34120 


32 


21,234 


21,266 


32 


15.5 


waw, bbx, slgA, HIc, mst 



In kilobases, along the X chromosome. 
1 Confidence Index: Maximum of — log(1— q f ) over the window, where q, is the posterior probability of hidden state "Selection." 



Discussion 

Our aim has been to provide a new statistical method for 
estimating the AFS and detecting selective sweeps that can 
be used with experimental setups where a sample of indi- 
viduals is sequenced in a single pool. As argued in Futschik 
and Schlotterer (2010), this experimental design is a cost- 
effective alternative to sequencing of individuals for pop- 
ulation genetic analysis based on allele frequencies. Often 
fairly large samples are sequenced at low individual cover- 
age using this approach. The analysis of NGS data from 
pools leads to new challenges, and existing methods for 
classical sequencing cannot directly be applied. Obviously, 
the per site coverage should be taken into account, and 
sites with high coverage should be more influential than 
sites with low coverage. Also, sampling of reads from 



the pool leads to an additional level of randomness that 
needs to be considered. 

A major methodological challenge for the analysis of 
NCS data at low coverage arises from sequencing errors, 
because such designs do not provide enough redundancy 
to distinguish sequencing errors reliably from true low- 
frequency variants. So far, most theoretical studies on 
the subject, for both individual sequencing and Pool- 
Seq, have considered a simple approach where sites with 
minor allele count/frequency below a given threshold 
are omitted (Achaz 2009; Jiang et al. 2009; Futschik and 
Schlotterer 2010; Lee et al. 2011). This strategy is also cur- 
rently popular for population genetic studies based on 
Pool-Seq data, see for instance Rubin et al. (2010). In con- 
trast, our method uses all sites, but accounts for the 
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-Q 
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1/194 25/194 49/194 73/194 

derived allele frequency 



97/194 1/194 25/194 49/194 73/194 

derived allele frequency 



97/194 



Fic. 2. AFS in Drosophila melanogaster. Estimated from all base calls (a) or only those with PHRED score greater than 35 (fa). As we consider the 
folded AFS, the probabilities for allele frequencies 98/194 to 193/194 (not shown) can be deduced by symmetry from those for allele 
frequencies 1/194 to 96/194. 



probability that a base call arises from a sequencing error. In 
principle, sequencing error probabilities could be deduced 
from the quality scores provided by the sequencing ma- 
chine. It is known, however, that the lllumina PHRED 
scores, for instance, are biased. We discuss this point below. 

First, we applied our method to simulated data, where 
we assumed accurate sequencing error probabilities. The 
obtained estimates of the AFS in pools from n = 25 
to n = 200 chromosomes using Pool-Seq data at 100x 
expected coverage were not biased and highly accurate 



(fig. 1). This implies for instance that the frequency of sin- 
gletons in a pool of 200 chromosomes can be reliably es- 
timated using pooled NCS data at this coverage, despite of 
sequencing errors. We then evaluated, again for n from 25 
to 200 and lOOx expected coverage, the power of detect- 
ing a selective sweep event using pooled data. Our method 
provided very similar levels of power to that for individual 
sequencing of the entire pool (table 1). These promising 
results for sweep detection indicate that Pool-Seq data pro- 
vide a rich source of information and may be suitable for 
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Fic. 3. Selective sweeps detected on the X chromosome of D. melanogaster. We used either all base calls or base calls with PHRED score greater 
than 35. The x axis labels permit to read off the physical position of the sweep window (in kilobases). 



2184 



Detecting Selective Sweeps from Pooled NCS Samples • doi:10.1093/molbev/mss090 



MBE 



the inference of demographic scenarios such as population 
bottlenecks or expansions. 

For applications to real data, the issue of inaccurate 
PHRED scores needs to be addressed. Unfortunately, no re- 
liable approach on how to deal with biased quality scores in 
the context of Pool-Seq has been described so far. Although 
several models dealing with Pool-Seq data and including 
sequencing error probabilities have been proposed for 
SNP selection (Bansal 2010; Li 2011; Wei et al. 2011) and 
population genetic parameter estimation (Li 2011), only 
a few (Bansal 2010; Li 2011) take advantage of PHRED 
scores to determine these sequencing error probabilities. 
Nevertheless, they did not evaluate the influence of this 
strategy in the context of real reads and quality scores. 
When analyzing two Pool-Seq samples of D. melanogaster, 
we obtained underestimates of the probabilities of extreme 
allele counts. To spot potential biases in the estimates of 
sequencing error probabilities, we proposed to obtain re- 
peated estimates of the AFS, by using base calls with dif- 
ferent ranges of attached PHRED scores. If the estimated 
AFS are different and given a sufficient amount of reads 
for each individual estimate, this suggests that at least some 
of the obtained estimates are biased. Our results also indi- 
cate that recalibrating the PHRED scores using GATK or 
other similar software can be difficult, if the populations 
involved have not been extensively studied so that a large 
proportion of the SNPs in the genome is already known. For 
nonmodel organisms, another possible strategy might be to 
sequence individually a small number of individuals at high 
coverage in order to recalibrate the quality scores, and 
a large pool of individuals at low coverage for further anal- 
ysis. Alternatively, one could include a known SNP-free 
DNA fragment in all sequencing runs and evaluate the se- 
quencing error probabilities using this fragment, as done in 
Druley et al. (2009). 

Fortunately, our sweep detection method turned out to 
be relatively insensitive to incorrect error probabilities. In- 
deed, we identified 32 selective sweep signatures, most of 
which were confirmed when using only high-quality base 
calls (PHRED score more than 35) and when analyzing an 
additional sample from the same population. One of the 
regions with the strongest evidence for selection was the 
wapl region, which was already identified as a sweep region 
in Europe (Beisswanger et al. 2006). A natural question is 
whether the signals our HMM is looking for, might have 
been caused by phenomena other than selective sweeps. 
One possibility are local fluctuations in the mutation pa- 
rameter that may arise for instance from variable levels of 
purifying selection among codon positions or coding/non- 
coding sequences. Although we do not take the density of 
segregating sites as a signal by itself, sequencing errors will 
lead to an increase in the proportion of sites with low num- 
bers of derived alleles in windows where 6 is small, as ob- 
served in our simulations. This is due to the fact that the 
classification between sequencing errors and correct reads 
is not perfect. Notice, however, that the effect on the AFS 
will be small, when only very high quality reads are used. For 
our data analysis, it is therefore reassuring that most of our 



sweep signals were confirmed when using only the high- 
quality reads. If we assume that local stretches of sequence 
where the mutation rate is reduced tend to be short, an- 
other argument for the limited influence of sequencing er- 
rors would be that the sweep windows we detected on 
chromosome X of D. melanogaster tended to be fairly large. 

If there is uncertainty about the homogeneity of the mu- 
tation rate at a larger scale, sweep detection (but not AFS 
estimation) can also be based on sites with at least k ob- 
served minor alleles, for instance with k = 2 or k = 3. Our 
method can easily be adapted for this purpose by replacing 

p ( Ef-i z .j=o) by p( z 'j <k ^ in e q uation (s). Note - 

however, that the computation time will increase. Indeed, 
while there is only one vector Z; verifying ^^JL Z,- j = 0, 
there are r l !/l!(r, — I)! vectors verifying J3|L 1 Zy=l and 
the likelihood of all these vectors needs to be computed 
for / from 0 to k-1. 

Like several other methods for the detection of selection, 
our approach is designed for hard sweeps with the favor- 
able allele being fixed recently. Partial selective sweeps, as 
well as soft sweeps, will therefore usually not be detected. 
On the other hand, it is well known that some demo- 
graphic effects, in particular bottlenecks, can produce sim- 
ilar genomic patterns as selective sweeps, potentially 
leading to false positives. In a previous study focusing 
on standard sequencing data (Boitard et al. 2009), we sim- 
ulated a wide range of bottleneck scenarios and showed 
that the HMM method proposed for individual sequencing 
generally led to fewer false positive signals than several 
competing methods. The reason is that HMMs do not only 
use the site frequency spectrum but take into account also 
the correlation of allele frequencies between sites. As bot- 
tlenecks tend to increase the correlation between sites, we 
expect also the Hidden Markov Model proposed here to be 
more robust against bottlenecks than for instance compos- 
ite likelihood methods. To put us further on the safe side, 
the sweeps detected in D. melanogaster were identified us- 
ing the HMM with very conservative tuning parameters 
(see Materials and Methods). 

Overall, our study shows that sequencing large pools of 
individuals at low coverage is a promising strategy for pop- 
ulation genetic analyzes. Indeed, the method we proposed 
permits for cost effective and powerful scans for selection 
using this type of data. Its practical applicability is dem- 
onstrated by the selective sweep signals we identified 
in D. melanogaster. Alternative cost effective sequencing 
strategies, such as Restriction site Associated DNA se- 
quencing (Hohenlohe et al. 2010) and Genotyping-by- 
Sequencing (Andolfatto et al. 2011; Elshire et al. 2011), have 
been proposed for population genetic studies based on 
large samples. These molecular methods generate individ- 
ual low-coverage sequence data for a subset of the genome, 
thus providing individual genotypes at a large number of 
SNPs (typically from tens to hundreds of thousands). This 
represents a clear advantage over Pool-Seq for applications 
requiring haplotype information. However, the estimation 
of individual genotypes requires a minimum per individual 
coverage, at the very least 3x for calling homozygotes and 
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5x for calling heterozygotes (Hohenlohe et at 2010). In 
contrast, our study demonstrates that a per individual cov- 
erage around 1 x is sufficient in a Pool-Seq analysis. Of 
course, sequencing individuals at 1 x or 2x coverage is also 
a reasonable strategy for allele or haplotype frequency es- 
timation, provided the uncertainty about individual geno- 
type calls is taken into account in the analyzes (Gompert 
et al. 2012). Although this experimental design was shown 
to be less efficient than Pool-Seq for allele frequency esti- 
mation (Futschik and Schlotterer 2010), it provides partial 
information about individual genotypes. Another advan- 
tage of Pool-Seq over Genotyping-by-Sequencing is to ex- 
plore the whole genome rather than a subset of positions. 
In populations with low levels of linkage disequilibrium, an 
(almost) exhaustive screening of the genome certainly in- 
creases the power of scans for selection or association stud- 
ies. For inference based on allele frequencies only, such as 
our method of detecting hard sweeps, we therefore believe 
that Pool-Seq is an attractive design. 

Supplementary Material 

Supplementary material and figures S1-S3 are available 
at Molecular Biology and Evolution online (http://www. 
mbe.oxfordjournals.org/). 
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